Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: shutdown race resilience #141

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

stevenh
Copy link
Contributor

@stevenh stevenh commented Jul 27, 2024

A significant rewrite to ensure that we don't suffer from shutdown race conditions as the prune condition is met and additional resources are being created.

Previously this would remove resources that were still in use, now we retry if we detect new resources have been created within a window of the prune condition triggering.

This supports the following new environment configuration settings:

  • RYUK_REMOVE_RETRIES - The number of times to retry removing a resource.
  • RYUK_REQUEST_TIMEOUT - The timeout for any Docker requests.
  • RYUK_RETRY_OFFSET - The offset added to the start time of the prune pass that is used as the minimum resource creation time

@stevenh stevenh requested a review from a team as a code owner July 27, 2024 20:28
@stevenh stevenh force-pushed the feat/resilience branch 6 times, most recently from c1b9be1 to 6137611 Compare July 27, 2024 21:24
@stevenh stevenh marked this pull request as draft July 27, 2024 21:44
@stevenh stevenh force-pushed the feat/resilience branch 10 times, most recently from f043d84 to 5f8c4ef Compare July 28, 2024 15:31
@stevenh stevenh marked this pull request as ready for review July 28, 2024 15:37
@stevenh stevenh force-pushed the feat/resilience branch 2 times, most recently from e7df41a to c8045ad Compare July 30, 2024 14:40
@mdelapenya
Copy link
Member

For some reason, I need to set up the loadConfig tests with RYUK_VERBOSE=false in order to make the tests pass locally:

tests := map[string]struct {
		setEnv   func(*testing.T)
		expected config
	}{
		"defaults": {
+			setEnv: func(t *testing.T) {
+				t.Helper()
+				t.Setenv("RYUK_VERBOSE", "false")
+			},
			expected: config{
				ConnectionTimeout:   time.Minute,
				Port:                8080,
				ReconnectionTimeout: time.Second * 10,
				RemoveRetries:       10,
				RequestTimeout:      time.Second * 10,
				RetryOffset:         -time.Second,
				ShutdownTimeout:     time.Minute * 10,
			},
		},
		"custom": {
			setEnv: func(t *testing.T) {
				t.Helper()
				t.Setenv("RYUK_PORT", "1234")
				t.Setenv("RYUK_CONNECTION_TIMEOUT", "2s")
				t.Setenv("RYUK_RECONNECTION_TIMEOUT", "3s")
				t.Setenv("RYUK_REQUEST_TIMEOUT", "4s")
				t.Setenv("RYUK_REMOVE_RETRIES", "5")
				t.Setenv("RYUK_RETRY_OFFSET", "-6s")
				t.Setenv("RYUK_SHUTDOWN_TIMEOUT", "7s")
+				t.Setenv("RYUK_VERBOSE", "false")
			},
			expected: config{
				Port:                1234,
				ConnectionTimeout:   time.Second * 2,
				ReconnectionTimeout: time.Second * 3,
				RequestTimeout:      time.Second * 4,
				RemoveRetries:       5,
				RetryOffset:         -time.Second * 6,
				ShutdownTimeout:     time.Second * 7,
			},
		},
	}

A bug in the env library for default booleans?

Copy link
Member

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this @stevenh. I see a lot of changes here, which I'd suggest splitting in different PRs:

  • config refactor
  • run function

Ryuk is a critical piece in the Testcontainers ecosystem, so updating it should be done very carefully to not affect all the related projects.

Would you mind separating this big PR into multiple ones? I can help you out with it if needed.

Thanks!

@stevenh
Copy link
Contributor Author

stevenh commented Aug 21, 2024

For some reason, I need to set up the loadConfig tests with RYUK_VERBOSE=false in order to make the tests pass locally:

A bug in the env library for default booleans?

Interesting what's the error you get?

@stevenh stevenh force-pushed the feat/resilience branch 2 times, most recently from 2169978 to 654f1a6 Compare September 3, 2024 12:56
@stevenh
Copy link
Contributor Author

stevenh commented Sep 3, 2024

For some reason, I need to set up the loadConfig tests with RYUK_VERBOSE=false in order to make the tests pass locally:

snip...

A bug in the env library for default booleans?

If you had environment vars set then the test would have failed as it was expecting a clean environment.

I've updated the test to clean the current environment before it starts, which could fix the issue you saw.

@mdelapenya mdelapenya self-assigned this Sep 3, 2024
@mdelapenya
Copy link
Member

Hi @stevenh, thanks for this PR, I've started the review and I see that there are some changes that would deserve its own PR:

  • the addition of a library to read the env vars. Is there any other library you considered for that? I like this one is simple and with zero deps.
  • the bumps for the different Go versions and tools (golangci-lint)
  • the real refactor

Could you separate them into three? I can help out with the two first so you can focus on the real meat here

@stevenh stevenh mentioned this pull request Sep 7, 2024
@stevenh stevenh force-pushed the feat/resilience branch 2 times, most recently from c8d5692 to f894119 Compare September 7, 2024 10:02
@stevenh
Copy link
Contributor Author

stevenh commented Sep 7, 2024

Ok so little more involved due to go bump needed new linter so we need to merge in the following order:

  1. chore: bump versions #160
  2. chore: simply config #159 -- Also includes version bump so it complies
  3. feat: shutdown race resilience #141 -- Will need a rebase after the config merge
  4. ci: enhanced linting #158 -- Currently erroring, but no point fixing

I didn't consider other env config libraries, as I've used that one a bunch of times and has worked well, but if there is something better happy to swap.

@mdelapenya
Copy link
Member

Given #159 has been merged, could you please rebase this one? 🙏

@stevenh stevenh force-pushed the feat/resilience branch 3 times, most recently from a87ce57 to 6c0b5b2 Compare September 30, 2024 20:22
mdelapenya
mdelapenya previously approved these changes Oct 2, 2024
Copy link
Member

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

I like the changes about the vocabulary: I'll miss the ryuk names, although I admit that using the names and words in this PR is now more inclusive and more maintainable for the long term, so thanks for this.

Once merged, as we are building the images for the commit, we can immediately test this image in testcontainers-go, using the commit-sha as docker tag (see https://hub.docker.com/r/testcontainers/ryuk/tags)

README.md Show resolved Hide resolved
@stevenh
Copy link
Contributor Author

stevenh commented Oct 3, 2024

@mdelapenya added some docker cli based tests for container, network, image and volume reaping so should be good to go.

Copy link
Member

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment regarding shelling out to the docker binary.

Other than that, will take the final view this weekend for final approval 🙏 I'm doing conferences next week, so I'll try to do before that.

reaper_test.go Outdated Show resolved Hide resolved
@stevenh stevenh force-pushed the feat/resilience branch 2 times, most recently from 40b4aed to b08d2e0 Compare October 4, 2024 12:58
A significant rewrite to ensure that we don't suffer from shutdown race
conditions as the prune condition is met and additional resources are
being created.

Previously this would remove resources that were still in use, now we
retry if we detect new resources have been created within a window of
the prune condition triggering.

This supports the following new environment configuration settings:
- RYUK_REMOVE_RETRIES - The number of times to retry removing a resource.
- RYUK_REQUEST_TIMEOUT - The timeout for any Docker requests.
- RYUK_RETRY_OFFSET - The offset added to the start time of the prune
  pass that is used as the minimum resource creation time.
- RYUK_SHUTDOWN_TIMEOUT - The duration after shutdown has been requested
  when the remaining connections are ignored and prune checks start.

Update README to correct example, as health is only valid for containers
not the other resources, so would cause failures.

Enable race detection on CI tests.
@mdelapenya mdelapenya added the enhancement New feature or request label Oct 7, 2024
Copy link
Member

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great PR, thanks for adding clarity in the reap process.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants